Back

Journal of Medical Imaging

SPIE-Intl Soc Optical Eng

Preprints posted in the last 30 days, ranked by how well they match Journal of Medical Imaging's content profile, based on 11 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.

1
Analysis Of Augmentation Techniques for Spine X-Ray Images

Sivakumar, E.; Anand, A.

2026-04-17 radiology and imaging 10.64898/2026.04.15.26350121 medRxiv
Top 0.1%
6.3%
Show abstract

Computer vision and deep learning techniques, including convolutional neural networks (CNNs) and transformers, have increased the performance of medical image classification systems. However, training deep learning models using medical images is a challenging task that necessitates a substantial amount of annotated data. In this paper, we implement data augmentation strategies to tackle dataset imbalance in the VinDr-SpineXR dataset, which has a lower number of spine abnormality X-ray images compared to normal spine X-ray images. Geometric transformations and synthetic image generation using Generative Adversarial Networks are explored and applied to the abnormal classes of the dataset, and classifier performance is validated using VGG-16 and InceptionNet to identify the most effective augmentation technique. Additionally, we introduce a hybrid augmentation technique that addresses class imbalance, reduces computational overhead relative to a GAN-only approach, and achieves ~99% validation accuracy with both classifiers across all three case studies. Keywords: Data augmentation, Generative Adversarial Network, VGG-16, InceptionNet, Class imbalance, Computer vision, Spine X-ray, Radiology.

2
Cross-Scanner Reliability of Brain MRI Foundation Model Embeddings: A Travelling-Heads Study

Navarro-Gonzalez, R.; Aja-Fernandez, S.; Planchuelo-Gomez, A.; de Luis-Garcia, R.

2026-03-25 radiology and imaging 10.64898/2026.03.23.26348808 medRxiv
Top 0.1%
3.7%
Show abstract

Foundation models (FMs) for brain magnetic resonance imaging (MRI) are increasingly adopted as pretrained backbones for clinical tasks such as brain age prediction, disease classification, and anomaly detection. However, if FM embeddings (internal representations) shift systematically across MRI scanners, downstream analyses built on them may reflect acquisition hardware rather than biology. No study has yet quantified this cross-scanner reproducibility. Here, we assess the cross-scanner reliability of brain MRI FM embeddings and investigate which design factors (pretraining strategy, network architecture, embedding dimensionality, and pretraining dataset scale) best explain the observed differences. Using the ON-Harmony travelling-heads dataset (20 participants, eight scanners, three vendors), we evaluate the embeddings of five architecturally diverse FMs and a FreeSurfer morphometric baseline via within- and between-scanner intraclass correlation coefficient (ICC), variance decomposition, and scanner fingerprinting. Reliability spanned the full spectrum: biology-guided models achieved good-to-excellent cross-scanner ICC (AnatCL: 0.970 [95\% confidence interval (CI): 0.94, 0.98]; y-Aware: 0.809 [0.63, 0.88]), matching or surpassing FreeSurfer (0.926 [0.83, 0.96]), whereas purely self-supervised models fell below the poor threshold (BrainIAC: 0.453, BrainSegFounder: 0.307, 3D-Neuro-SimCLR: 0.247), with 23--58\% of embedding variance attributable to scanner identity. The strongest correlate of cross-scanner reliability among the models evaluated was pretraining strategy: incorporating biological metadata (cortical morphometrics, age) into the contrastive objective produced scanner-robust embeddings, whereas architecture, dimensionality, and dataset scale did not predict reliability.

3
Normal is All You Need: A Symmetry-Informed Inverse Learning Foundation Model for Neuroimaging Diagnostics

Wang, S.; Ayubcha, C.; Hua, Y.; Beam, A.

2026-04-12 radiology and imaging 10.64898/2026.04.10.26350553 medRxiv
Top 0.1%
2.7%
Show abstract

Background: Developing generalizable neuroimaging models is often hindered by limited labeled data which has led to an increased interest in unsupervised inverse learning. Existing approaches often neglect geometric principles and struggle with diverse pathologies. We propose a symmetry-informed inverse learning foundation model to address these shortcomings for robust and efficient anomaly detection in brain MRI. Methods: Our framework employs a reconstruction-to-embedding pipeline, trained exclusively on healthy brain MRI slices. A 2D U-Net uses a novel, symmetry-aware masking strategy to reconstruct a disorder-free slice. Difference maps are embedded into a 1024-dimensional latent space via a Beta-VAE. Anomaly scoring is performed using Mahalanobis distance. We evaluated generalization by fine-tuning on external lesion datasets, BraTS Africa (SSA), and the ADNI-derived Alzheimer disease cohort (Alz). Results: On the source metastasis (Mets) dataset, the framework achieved high performance (AB1+MSE: 99.28% accuracy, 99.79% sensitivity). Generalization to the external lesion dataset (SSA) was robust, with the Symmetry ROC configuration achieving 91.93% accuracy. Transfer to the Alzheimer dataset (Alz) was more challenging, achieving a peak accuracy of 70.54% with a high false-positive rate, suggesting difficulty in separating subtle, diffuse changes. Conclusion: The symmetry-informed inverse learning framework establishes a robust foundation model for neuroimaging, showing strong performance for focal lesions and successful generalization under domain shift. Limitations in diffuse neurodegeneration underscore the necessity for richer representations and multimodal integration to improve future foundation models.

4
Artificial Intelligence Devices for Image Analysis in Digital Pathology

Matthews, G. A.; Godson, L.; McGenity, C.; Bansal, D.; Treanor, D.

2026-03-26 pathology 10.64898/2026.03.23.26349089 medRxiv
Top 0.1%
1.7%
Show abstract

BO_SCPLOWACKGROUNDC_SCPLOWThere is increasing momentum behind the clinical implementation of AI-based software for image analysis in digital pathology. As regulations, standards, and national approaches to the clinical use of AI continue to develop, the marketplace of AI products is expanding and evolving - presenting pathologists with a multitude of devices that offer the potential to improve pathology services. MO_SCPLOWETHODSC_SCPLOWTo maintain pace with this changing AI device landscape, we conducted a comprehensive search for, and analysis of, commercial AI products for image analysis in digital pathology. This included CE-marked and Research Use Only (RUO) products using images with histological stains (e.g., H&E) or immunohistochemical (IHC) labelling. Product information and published clinical validation studies were assessed, to understand the quality of supporting evidence on available products, and product details were compiled into a public register: https://osf.io/gb84r/overview. RO_SCPLOWESULTSC_SCPLOWIn total, we identified and assessed 90 CE-marked and 227 RUO AI products. We found that AI products for cancer detection in prostate and breast pathology comprised a substantial portion of the marketplace for H&E image analysis, while IHC products were almost exclusively for use in breast cancer. Clinical validation studies on these products have steadily increased; however, we found that published studies were only available for just over half of H&E products and just over a quarter of IHC products. For CE-marked products, the dataset quality and diversity for AI model performance validation was highly variable, and particularly limited for IHC products. Furthermore, only a limited number of products included studies that assessed measures of clinical utility. CO_SCPLOWONCLUSIONC_SCPLOWAs clinical deployment of AI products for image analysis in histopathology grows, there is a need for transparency, rigorous validation, and clear evidence supporting clinical utility and cost-effectiveness. Independent scrutiny of the expanding offering of AI products provides insight into the opportunities and shortcomings in this domain.

5
Hierarchical Barycentric Multimodal Representation Learning for Medical Image Analysis

Qiu, P.; An, Z.; Ha, S.; Kumar, S.; Yu, X.; Sotiras, A.

2026-04-06 neurology 10.64898/2026.04.05.26350202 medRxiv
Top 0.1%
1.7%
Show abstract

Multimodal medical image analysis exploits complementary information from multiple data sources (e.g., multi contrast Magnetic Resonance Imaging (MRI), Diffusion Tensor Imaging (DTI), and Positron Emission Tomography (PET)) to enhance diagnostic accuracy and support clinical decision making. Central to this process is the learning of robust representations that capture both modality invariant and modality specific features, which can then be leveraged for downstream tasks such as MRI segmentation and normative modeling of population level variation and individual deviations. However, learning robust and generalizable representations becomes particularly challenging in the presence of missing modalities and heterogeneous data distributions. Most existing methods address this challenge primarily from a statistical perspective, yet they lack a theoretical understanding of the underlying geometric behavior such as how probability mass is allocated across modalities. In this paper, we introduce a generalized geometric perspective for multimodal representation learning grounded in the concept of barycenters, which unifies a broad class of existing methods under a common theoretical perspective. Building on this barycentric formulation, we propose a novel approach that leverages generalized Wasserstein barycenters with hierarchical modality specific priors to better preserve the geometry of unimodal distributions and enhance representation quality. We evaluated our framework on two key multimodal tasks brain tumor MRI segmentation and normative modeling demonstrating consistent improvements over a variety of multimodal approaches. Our results highlight the potential of scalable, theoretically grounded approaches to advance robust and generalizable representation learning in medical imaging applications.

6
A Deployable Explainable Deep Learning System for Tuberculosis Detection from Chest X-Rays in Resource-Constrained High-Burden Settings

Agumba, J.; Erick, S.; Pembere, A.; Nyongesa, J.

2026-04-01 radiology and imaging 10.64898/2026.03.31.26349662 medRxiv
Top 0.1%
1.7%
Show abstract

Abstract Objectives: To develop and evaluate a deployable deep learning system with Gradient-weighted Class Activation Mapping (Grad-CAM) for tuberculosis screening from chest radiographs and to assess its classification performance and explainability across desktop and mobile deployment platforms. Materials and methods: This study used publicly available chest X-ray datasets containing Normal and Tuberculosis images. A DenseNet121-based transfer learning model was trained using stratified training, validation, and test splits with data augmentation and class weighting. Model performance was evaluated using accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC). Grad-CAM was used to visualize regions influencing model predictions. The trained model was converted to TensorFlow Lite and deployed in both a Windows desktop application and a Flutter-based mobile application for offline inference and visualization. Results: The model demonstrated strong classification performance on the independent test dataset, with high accuracy and AUC values indicating effective discrimination between Normal and Tuberculosis cases. Grad-CAM visualizations showed that the model focused primarily on anatomically relevant lung regions, particularly the upper and mid-lung fields in Tuberculosis cases. Deployment testing confirmed consistent prediction outputs and Grad-CAM visualizations across both Windows and mobile platforms. Conclusion: The proposed deployable deep learning system with Grad-CAM provides accurate and interpretable tuberculosis screening from chest radiographs and demonstrates feasibility for offline mobile and desktop deployment. This approach has potential as an artificial intelligence-assisted screening and decision support tool in radiology, particularly in resource-limited and remote healthcare settings.

7
Visual Fidelity-Driven Quality Assessment of Medical Image Translation

Bizjak, Z.; Zagar, J.; Spiclin, Z.

2026-03-20 radiology and imaging 10.64898/2026.03.18.26348721 medRxiv
Top 0.1%
1.7%
Show abstract

Automated and reliable image quality assessment (IQA) is essential for safe use of medical image synthesis in critical applications like adaptive radiotherapy, treatment planning, or missing-modality reconstruction, where unnoticed generative artifacts may adversely affect outcomes. We evaluated image-to-image translation quality by coupling large-scale expert visual quality assessment with explainable automated IQA modeling. Adversarial diffusion-based framework, SynDiff, was applied to four cross-modality synthesis tasks, including three inter-MR and a CBCT-to-CT translation. Using four-fold cross-validation, ten reference-based and eight no-reference IQA metrics were computed for all synthesized images. Visual IQA ratings were independently collected from thirteen expert raters using predetermined protocol and specialized image viewer enabling blinded, randomized six-point Likert scoring. Auto-Sklearn was employed to learn ensemble regression models mapping IQA metrics to visual consensus ratings, with separate models trained on reference-based and no-reference metrics. The models closely reproduced distribution and ordering of expert ratings, typically within +/- 0.5 Likert points. Reference-based models achieved higher agreement with visual ratings than no-reference models (R^2 0.75 vs. 0.59, resp.), although the latter remained unbiased and informative. Explainability analyses highlighted structure- and contrast-sensitive metrics as key predictors. Overall, the results demonstrate that ensemble regression models can provide transparent, scalable, and clinically meaningful quality control for generative medical imaging.

8
Improving Glioblastoma Classification Using Quantitative Transport Mapping with a Synthetic Data Trained Deep Neural Network

Romano, D. J.; Roberts, A. G.; Weppner, B.; Zhang, Q.; John, M.; Hu, R.; Sisman, M.; Kovanlikaya, I.; Chiang, G. C.; Spincemaille, P.; Wang, Y.

2026-04-01 radiology and imaging 10.64898/2026.03.31.26349864 medRxiv
Top 0.1%
1.7%
Show abstract

Purpose: To develop a deep neural network-based, AIF-free, perfusion estimation method (QTMnet) for improved performance on glioma classification. Methods: A globally defined arterial input function (AIF) is needed to recover perfusion parameters in the two-compartment exchange model (2CXM). We have developed Quantitative Transport Mapping (QTM) to create an AIF-independent estimation method. QTM estimation can be formulated using deep neural networks trained on synthetic DCE-MRI data (QTMnet). Here, we provide a fluid mechanics-based DCE-MRI simulation with exchange between the capillaries and extravascular extracellular space. We implemented tumor ROI generation to morphologically characterize tissue perfusion. We compared our QTMnet implementation with 2CXM on 30 glioma human subjects, 15 of which had low-grade gliomas, and 15 with high-grade glioblastomas. Results: QTMnet outperforms (best AUC: 0.973) traditional 2CXM (best AUC: 0.911) in a glioma grading task. Conclusion: The AIF-independent QTMnet estimation provides a quantitative delineation between low-grade and high-grade gliomas.

9
The false positive paradox: Examining real-world clinical predictive performance of FDA-authorized AI devices for radiology using clinical prevalence

Sparnon, E.; Stevens, K.; Song, E.; Harris, R. J.; Strong, B. W.; Bruno, M. A.; Baird, G. L.

2026-03-27 radiology and imaging 10.64898/2026.03.25.26349197 medRxiv
Top 0.1%
1.7%
Show abstract

The present study evaluates the real-world clinical predictive performance of FDA-authorized artificial intelligence (AI) devices used in radiology, focusing on the false positive paradox (FPP) and its implications for clinical practice. To do this, we analyzed publicly available FDA data on AI radiology devices from 2024 and 2025 from 510(k) summaries, demonstrating how diagnostic accuracy metrics like sensitivity and specificity do not necessarily translate into high positive predictive value (PPV) due to the influence of target disease prevalence. We show the importance of disclosing the false discovery (FDR) and false omission rates (FOR) and argue that this transparency enables clinicians to select AI systems that balance false positive and false negative costs in a clinically, ethically, and financially appropriate manner. Finally, we provide recommendations for what data should be provided to best serve practices and radiologists.

10
Impact of simulated MRI artifacts on deep learning-based brain age prediction

Hendriks, J.; Jansen, M. G.; Joules, R.; Pena-Nogales, O.; Elsen, F.; Povolotskaya, A.; Dijsselhof, M. B. J.; Rodrigues, P. R.; Barkhof, F.; Schrantee, A.; Mutsaerts, H.

2026-03-26 radiology and imaging 10.64898/2026.03.24.26349152 medRxiv
Top 0.1%
1.6%
Show abstract

Brain age is a promising biomarker for detecting atypical and pathological brain aging, but its accuracy and reliability depend critically on MRI quality. The impact of common MR image degradations such as motion, ghosting, blurring, and noise on brain age predictions remains unclear. In this study, we systematically assessed the effects of four simulated MRI artifact types, across ten severity levels, on brain age prediction using three widely used deep learning-based algorithms (Pyment, MIDI, MCCQR), in high-quality T1-weighted images of healthy adults (age range 18-85, 54% female). Artifact severity levels (1-10) were generated using a power-function mapping of TorchIO simulation parameters calibrated to the full PondrAI QC visual rating scale (from perfect to severely degraded image quality). Linear mixed-effects models with predicted brain age as dependent variable revealed a significant interaction between algorithm, artifact type, and severity (p<0.001), indicating algorithm-specific sensitivity to artifacts. In artifact-free scans, mean absolute error (MAE) was 4.6 years for MCCQR, 7.1 years for Pyment, and 9.1 years for MIDI. At severity level 10, MAE increased with up to 110% for Pyment, 112% for MCCQR, and 16% for MIDI (motion); and with up to 75% for Pyment, 135% for MCCQR, and 34% for MIDI (ghosting). Blurring had minimal impact at low-moderate levels, but at maximum severity MAE increased by 26% for Pyment and 137% for MCCQR, while MIDI remained largely stable. Noise minimally affected Pyment and MCCQR (MAE increases [&le;]20%), but led to larger declines for MIDI (MAE increase 35%). The vulnerability of different algorithms highlights that training data, preprocessing strategies and underlying architectures influence robustness, emphasizing that artifact sensitivity is a key consideration when interpreting brain-age as a biomarker. Our results emphasize the need for artifact-aware evaluation and mitigation strategies when algorithms such as brain age are used in clinical research.

11
A Cohort-Based Global Sensitivity Benchmark of MRI-Derived Whole-Heart Electromechanical Models in Healthy Hearts

Rahmani, S.; Pouliopoulos, J.; W. C. Lee, A.; Barrows, R. K.; Solis-Lemus, J. A.; Strocchi, M.; Rodero, C.; Qayyum, A.; Lashkarinia, S.; Roney, C.; Augustin, C. M.; Plank, G.; Fatkin, D.; Jabbour, A.; Niederer, S. A.

2026-03-30 systems biology 10.64898/2026.03.27.714701 medRxiv
Top 0.1%
1.6%
Show abstract

Patient-specific four-chamber electromechanical models provide a physics-constrained framework for investigating whole-heart cardiac physiology and disease mechanisms. Identifying which model parameters impact whole-heart function is important for understanding cellular-, tissue-, and organ-scale determinants of cardiac performance and for calibrating patient-specific models. However, previous global sensitivity analyses of cardiac electromechanical models have typically been performed on a single heart, and systematic evaluation of how parameter influence compares across anatomically different subjects remains limited. We created four-chamber electromechanical models using cardiac MRI from five healthy subjects (n = 5). The models simulated atrial and ventricular cellular electrophysiology, calcium dynamics, and active contraction, with heterogeneous fibre orientation, transversely isotropic tissue mechanics, pericardial constraint, and a closed-loop cardiovascular system providing physiological boundary conditions. In total, 46 parameters described the integrated model. Using Gaussian process emulators, we performed multi-scale global sensitivity analysis to evaluate the relative contribution of model parameters to left and right atrial and ventricular function. Across all anatomies, the most influential parameters were systemic and pulmonary resistances, ventricular end-diastolic pressures, and the venous reference pressure, highlighting the dominant role of haemodynamic loading conditions in governing pressure- and volume-based outputs. A chamber-level analysis of atrioventricular coupling revealed a phase-dependent pattern. Atrial pressures were predominantly governed by global haemodynamic parameters (> 90% of total sensitivity), atrial filling volumes showed substantial ventricular influence ({approx}40-55% across anatomies), and atrial end-systolic volumes were primarily determined by intrinsic atrial parameters ({approx}60-65%). These patterns were consistent across subjects despite differences in anatomy. We show that, in healthy male subjects, inter-individual anatomical variation does not substantially change the ranking of dominant parameters. This work provides a repeatable modelling and sensitivity analysis framework and establishes a benchmark reference for whole-heart electromechanical modelling in healthy hearts. Author summaryComputational models of the heart can simulate cardiac physiology in unprecedented detail, but these models contain many parameters whose influence on predicted function is not fully understood. We built patient-specific four-chamber heart models from MRI scans of five healthy subjects and used statistical methods to systematically test how 46 model parameters affect simulated cardiac performance. Across all five subjects, we found that the haemodynamic loading parameters, including systemic and pulmonary vascular resistance, ventricular filling pressures, and the venous reference pressure, consistently had the greatest influence on the model outputs, regardless of differences in individual heart anatomy. This finding suggests that in healthy resting conditions, the boundary conditions of the cardiovascular system, rather than individual differences in heart geometry or electrical properties, are the primary drivers of whole-heart function. We also found a structured coupling pattern between the upper and lower heart chambers, where global haemodynamic parameters dominate atrial pressure regulation, ventricular mechanics shape atrial filling, and intrinsic atrial properties control atrial emptying. This work provides a benchmark dataset of five anatomically detailed heart models and a sensitivity analysis framework to guide calibration of future cardiac digital twin models.

12
A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia

Dennstaedt, F.; Cihoric, N.; Bachmann, N.; Filchenko, I.; Berclaz, L.; Crezee, H.; Curto, S.; Ghadjar, P.; Huebenthal, B.; Hurwitz, M. D.; Kok, P.; Lindner, L. H.; Marder, D.; Molitoris, J.; Notter, M.; Rahman, S.; Riesterer, O.; Spalek, M.; Trefna, H.; Zilli, T.; Rodrigues, D.; Fuerstner, M.; Stutz, E.

2026-03-26 oncology 10.64898/2026.03.25.26349254 medRxiv
Top 0.2%
1.2%
Show abstract

BackgroundLarge Language Models (LLMs) have demonstrated expert-level performance across many medical domains, suggesting potential utility in clinical practice. However, their reliability in the highly specialized domain of moderate hyperthermia (HT) remains unknown. We therefore evaluated the performance of three modern LLMs in answering HT-related questions. MethodsWe conducted an evaluation study by posing 40 open-ended questions--22 clinical and 18 physics-related--to three modern LLMs (DeepSeek-V3, Llama-3.3-70B-Instruct, and GPT-4o). Responses were blinded, randomized, and evaluated by 19 international experts with either a clinical or physics background for quality (5-point Likert scale: 1=very bad, 2=bad, 3=acceptable, 4=good to 5=very good) and for potential harmfulness in clinical decision-making. ResultsA total of 1144 quality evaluation responses were collected. Overall reported mean quality scores were similar across models, with DeepSeek scoring 3.26, Llama 3.18, and GPT-4o 3.07, corresponding to an "acceptable" rating. Across expert evaluations, responses were considered potentially harmful in 17.8% of cases for DeepSeek, 19.3% for Llama, and 15.3% for GPT-4o. Notably, despite "acceptable" mean scores, approximately 25% of responses were rated "bad" to "very bad," and potentially harmful answers occurred in [~]15-19% of evaluations, indicating a non-trivial risk if used without domain expertise. ConclusionOur findings indicate that the performance of LLMs in HT in versions available at the time of investigation is only partially satisfactory. The proportion of poor-quality responses is too high and may lead non-domain experts to misinterpret the available clinical evidence and draw inappropriate clinical conclusions.

13
AI-Assisted Pneumonia Detection, Localisation and Report Generation from Chest X-rays

Boiardi, F. E.; Lain, A. D.; Posma, J. M.

2026-03-23 radiology and imaging 10.64898/2026.03.20.26348879 medRxiv
Top 0.2%
1.2%
Show abstract

Pneumonia detection in chest X-rays (CXRs) is complicated by high inter-observer variability and overlapping radiographic patterns. While deep learning (DL) solutions show promise, limitations in generalisability and explainability hinder clinical adoption. We address these challenges by introducing a holistic DL-based computer-aided diagnosis (CAD) pipeline for pneumonia detection, localisation, and structured report generation from CXRs. We curated the largest composite of publicly available CXRs to date (N=922,634), of which [Formula] were used for training. MIMIC-CXR radiology reports were relabelled using a local large language model (LLM), positing that LLM-derived pneumonia labels would yield higher diagnostic sensitivity than the provided rule-based natural language processing (rNLP) labels. DenseNet-121 classifiers were trained on four configurations: MIMIC-CXR (rNLP), MIMIC-CXR (LLM), and each supplemented with VinDr-CXR data. Gradient-weighted Class Activation Mapping (Grad-CAM) provided visual explainability and lung zone-based localisation. LLM-driven relabelling significantly improved human-label agreement (96.5% vs 72.5%, P=1.66x10-11). The best-performing model (MIMIC-CXR (LLM) + VinDr-CXR) achieved 82.08% sensitivity and 81.97% precision, surpassing both radiologist sensitivity ranges (64-77.7%) and CheXNets pneumonia F1-score (43.5%). Grad-CAM localisation attained a moderate F1-score of 52.9% (sensitivity=65.7%, precision=44.3%), confirming focus alignment with pathological lung regions while highlighting areas for refinement. These findings demonstrate that LLM-driven label curation, combined with DL, can exceed conventional rNLP and radiologist performance, advancing high-quality data integration in predictive medical imaging. Clinically, our pipeline offers rapid triage, automated report drafting, and real-time pneumonia surveillance; tools that can streamline radiology workflows and mitigate diagnostic errors.

14
Data Matters: The Impact of Data Curation in the Classification of Histopathological Datasets

Brito-Pacheco, D. A.; Giannopoulos, P.; Reyes-Aldasoro, C. C.

2026-04-17 pathology 10.64898/2026.04.16.26351016 medRxiv
Top 0.2%
1.2%
Show abstract

In this work, the impact of outliers on the performance of machine learning and deep learning models is investigated, specifically for the case of histopathological images of colorectal cancer stained with Haematoxylin and Eosin. The evaluation of the impact is done through the systematic comparison of one machine learning model (Random Forests) and one deep learning model (ResNet-18). Both models were trained with the popular NCT-CRC-HE-VAL-100K dataset and tested on the CRC-HE-VAL-7K companion set. Then, a curation process was performed by analysing the divergence of patches based on chromatic, textural and topological features of the training set and removing outliers to repeat the training with a cleaned dataset. The results showed that machine learning models, can benefit more from improvements in the quality of data, than deep learning models. Further, the results suggest that deep learning models are more robust to outliers as, through the training process, the architectures can learn features other than those previously mentioned.

15
Information-Guided Parameter Optimisation for MR Elastography Radiomics

Djebbara, I.; Yin, Z.; Friismose, A. I.; Poulsen, F. R.; Hojo, E.; Aunan-Diop, J. S.

2026-03-20 radiology and imaging 10.64898/2026.03.17.26348578 medRxiv
Top 0.2%
1.2%
Show abstract

Mechanical properties of biological tissues vary across spatial scales, yet radiomics typically relies on fixed, heuristic choices for neighbourhood size, kernel geometry, and spectral content - choices that can silently reshape the feature space before any modelling begins. We introduce a label-free, information-theoretic framework for selecting extraction parameters in multi-frequency MRE radiomics. For each configuration {theta} - neighbourhood radius r, kernel geometry k (sphere or shell), and frequency subset f - we extract a radiomics feature matrix and score it using an objective J({theta}) that integrates distributional richness (Shannon entropy), cross-frequency coherence (canonical correlation), inter-feature redundancy (Spearman correlation), and bootstrap stability. We evaluate 121 configurations per tissue in multi-frequency MRE (30-60 Hz) of human brain, liver, and a calibrated phantom, and test robustness using 10,000 Dirichlet-sampled objective weightings. Across tissues, neighbourhood aggregation is consistently preferred over voxel-wise extraction, outperforming the no-neighbourhood baseline in 98.4-100% of weightings. External validation in 100 independent brain scans acquired with a different protocol and wider frequency range (20-90 Hz) confirms a reproducible mesoscopic plateau at r = 3-5 (9-15 mm), with a modal optimum at r = 4; omitting neighbourhood analysis reduces J({theta}) by 38% relative to each subject's optimum. Frequency-subset preferences replicate across datasets, with lower frequencies most frequently selected for brain. By turning ad hoc extraction choices into an outcome-free optimisation step, this framework improves reproducibility, reduces sensitivity to heuristic parameter choices, and generalises across acquisition protocols and imaging sites.

16
HybridNet-XR: Efficient Teacher-Free Self-Supervised Learning for Autonomous Medical Diagnostic Systems in Resource-Constrained Environments.

Mayala, S.; Mzurikwao, D.; Suluba, E.

2026-03-19 health informatics 10.64898/2026.03.16.26348570 medRxiv
Top 0.2%
1.1%
Show abstract

Deep learning model classification on large datasets is often limited in countries with restricted computational resources. While transfer learning can offset these limitations, standard architectures often maintain a high memory footprint. This study introduces HybridNet-XR, a memory-efficient and computationally lightweight hybrid convolutional neural network (CNN) designed to bridge the domain gap in medical radiography using autonomous self-supervised learning protocols. The HybridNet-XR architecture integrates depthwise separable convolutions for parameter reduction, residual connections for gradient stability, and aggressive early downsampling to minimize the video RAM (VRAM) footprint. We evaluated several training paradigms, including teacher-free self-supervised learning (SSL-SimCLR), teacher-led knowledge distillation (KD), and domain-gap (DG) adaptation. Each variant was pre-trained on ImageNet-1k subsets and fine-tuned on the ChestX6 multi-class dataset. Model interpretability was validated through gradient-weighted class activation mapping (Grad-CAM). The performance frontier analysis identified the HybridNet-XR-150-PW (Pre-warmed) as the optimal configuration, achieving a 93.38% average accuracy and 99% AUC while utilizing only 814.80 MB of VRAM. Regarding class-wise accuracy, this variant significantly outperformed standard MobileNetV2 and teacher-led models in critical diagnostic categories, notably Covid-19 (97.98%) and Emphysema (96.80%). Grad-CAM visualizations confirmed that the teacher-free pre-warming phase allows the model to develop sharper, anatomically grounded focus on pathological landmarks compared to distilled models. Specialized pre-warming schedules offer a viable, computationally autonomous alternative to knowledge distillation for medical imaging. By eliminating the requirement for high-performance teacher models, HybridNet-XR provides a robust and trustworthy diagnostic foundation suitable for clinical deployment in resource-constrained environments. Author summaryTraditional deep learning models for medical imaging are often too large for the low-power computers available in many global health settings. We developed a new model to bridge this computational gap. We designed HybridNet-XR, a highly efficient AI architecture, and trained it using a "teacher-free" method that doesnt require a massive supercomputer. We found a specific version (H-XR150-PW) that provides high accuracy while using very little memory. Our results show that high-performance diagnostic AI can be deployed on standard, low-cost hardware. Furthermore, using visual heatmaps (Grad-CAM), we proved that the AI correctly identifies medical landmarks like lung opacities, ensuring it is safe and reliable for real-world clinical use.

17
CorSeg-CineSAX: An Open-Source Deep Learning Framework for Fully Automatic Segmentation of Short-Axis Cine Cardiac MRI Across Multiple Cardiac Diseases

Xu, R.; Jiang, S.; Zhai, Y.; Chen, Y.

2026-04-03 cardiovascular medicine 10.64898/2026.04.01.26349955 medRxiv
Top 0.2%
1.0%
Show abstract

Background: Segmentation of the left ventricular myocardium, left ventricular cavity, and right ventricular cavity on short-axis cine cardiac magnetic resonance (CMR) images is essential for quantifying cardiac structure and function. However, existing automated segmentation tools are limited by small training datasets, narrow disease coverage, restrictive input format requirements, and the absence of anatomical plausibility constraints, hindering their clinical adoption. Methods: We constructed the largest annotated CMR short-axis segmentation dataset to date, comprising 1,555 subjects from 12 centers with five cardiac disease types and full cardiac cycle annotations totaling 319,175 labeled images. A MedNeXt-L model was trained using a 2D slice-by-slice strategy with full field-of-view input, eliminating dependencies on 3D volumes, temporal sequences, or region-of-interest(ROI) localization. A deterministic three-step post-processing pipeline was designed to enforce anatomical priors: connected component constraint, containment relationship constraint, and gap-filling constraint. The model was validated on an internal test set (310 subjects) and three independent public external datasets (ACDC, M&Ms1, M and Ms2; 855 subjects from 6 additional centers across 3 countries), spanning 15 cardiac disease categories-10 of which were never encountered during training. Results: The model achieved mean Dice similarity coefficients (DSC) of 0.913 {+/-} 0.037 and 0.911 {+/-} 0.040 on internal and external test sets, respectively, with a cross-domain performance gap of only 0.002. Post-processing eliminated all containment violations (7.5% [-&gt;] 0%) and gap errors (1.8% [-&gt;] 0%) while reducing fragment rates by 85.5% (9.0% [-&gt;] 1.3%). Zero-shot generalization to 10 unseen disease categories yielded DSC values ranging from 0.899 to 0.921. Automated clinical functional parameters demonstrated excellent agreement with manual measurements for left ventricular indices and right ventricular volumes (intraclass correlation coefficients [&ge;] 0.977). Conclusions: CorSeg-CineSAX provides a robust, open-source framework for fully automatic CMR short-axis segmentation across diverse clinical scenarios. All source code and pre-trained weights are publicly available at https://github.com/RunhaoXu2003/CorSeg.

18
Discordance in pleural mesothelioma response classification and modelling of impact on clinical trials

Cowell, G. W.; Roche, J.; Noble, C.; Stobo, D. B.; Papanastasiou, A.; Kidd, A. C.; Tsim, S.; Blyth, K. G.

2026-03-20 oncology 10.64898/2026.03.18.26348731 medRxiv
Top 0.3%
0.9%
Show abstract

Introduction Agreement between radiologists regarding treatment response in Pleural Mesothelioma (PM) is acknowledged to be poor, but downstream effects in clinical trials have not been quantified. Methods We performed a mixed methods study, composed of a multicentre, retrospective cohort study and in silico modelling. CT images and data were retrieved from 4 UK centres regarding chemotherapy-treated patients. Expert radiologists classified response using modified Response Evaluation Criteria In Solid Tumours criteria (mRECIST) v1.1, generating discordance rate (%) and agreement. In silico modelling simulated two-arm trials of an active therapy with intended 80% power and confidence intervals for four endpoints (objective response rate (ORR), disease control rate (DCR), progression-free survival (PFS), overall survival (OS)) covering 95% of the true effect. Actual power and endpoint coverage were modelled against mRECIST misclassification rate (a single reporter equivalent of discordance rate). Consecutive simulations varied misclassification rate from 0-100% in 1% increments, each repeated 10,000 times. Results 172 cases were included. Discordance rate was 35% (60/172), kappa=0.456. In silico modelling demonstrated reduced power and endpoint precision with increasing misclassification. At 17% misclassification, corresponding to the observed 35% discordance, power dropped from 80% to 55% for ORR, 53% for DCR, 65% for PFS and 66% for OS, with endpoint coverage reduced to 88%, 89%, 92% and 92%, respectively. 50/60 (83%) discordances reflected interpretation or measurement differences intrinsic to mRECIST. Discordance was not associated with tumour volume. Conclusions Inconsistent response classification is common in PM and substantially reduces statistical power and endpoint precision in clinical trials.

19
VIsual STAndardized Quantification of LGE (VISTAQ), a contour-less method for late gadolinium enhancement quantification

Aquaro, G. D.; Licordari, R.; De Gori, C.; Todiere, G.; Ianni, U.; Barison, A.; De Luca, A.; Folgheraiter, a.; Grigoratos, C.; alberti, m.; lombardo, m.; De Caterina, R.; Sinagra, G.; Emdin, M.; Di Bella, G.; fulceri, l.

2026-04-15 cardiovascular medicine 10.64898/2026.04.09.26350552 medRxiv
Top 0.3%
0.9%
Show abstract

Background: Late gadolinium enhancement (LGE) quantification by cardiovascular magnetic resonance is central to risk stratification in hypertrophic cardiomyopathy (HCM), yet conventional techniques require contour tracing and region-of-interest (ROI) placement, which may reduce reproducibility and increase analysis time. We developed a novel visual standardized approach, the Visual Standardized Quantification of LGE (VISTAQ), that does not require myocardial contouring, arbitrary ROI positioning, or dedicated post-processing software. Methods: In this multicenter, multivendor retrospective study, LGE images from 400 patients (100 prior myocardial infarction, 250 HCM, 50 other non-ischemic heart diseases) were analyzed. VISTAQ subdivides each myocardial segment into transmural mini-segments and classifies LGE visually using predefined criteria, expressing global LGE burden as the percentage of positive mini-segments. Reproducibility was assessed in 250 patients across different observer expertise levels using intraclass correlation coefficients (ICC) and Bland?Altman analysis. In 100 HCM patients, VISTAQ was compared with conventional methods (mean+2SD, +5SD, +6SD, FWHM, visual thresholding). Prognostic performance was evaluated in 250 HCM patients over a median 5-year follow-up. Results: VISTAQ demonstrated excellent intra- and inter-observer reproducibility (ICC up to 0.98 and 0.97, respectively), consistent across disease subtypes. Compared with conventional techniques, VISTAQ showed similar ICC to FWHM but significantly lower net and absolute inter-observer differences (median absolute difference 1.3%). Mean+2SD markedly overestimated LGE, whereas mean+6SD slightly underestimated LGE compared with VISTAQ, mean+5SD, FWHM, and visual thresholding. Analysis time was substantially shorter with VISTAQ (median 105 vs. 375 seconds, p<0.0001). During follow-up, 21 hard cardiac events occurred in HCM population. An LGE threshold >10% predicted events with higher accuracy using VISTAQ (AUC 0.90; sensitivity 85%; specificity 94%) compared with mean+6SD (AUC 0.75; sensitivity 57%; specificity 93%). Conclusions: VISTAQ provides highly reproducible, time-efficient LGE quantification without dedicated software and demonstrates non-inferior prognostic discrimination in HCM compared with conventional threshold-based techniques.

20
A Novel Fixel-Based Approach for Resolving Neonatal White Matter Microstructure from Clinical Diffusion MRI

Newman, B.; Puglia, M. H.

2026-03-23 neurology 10.64898/2026.03.17.26348387 medRxiv
Top 0.3%
0.8%
Show abstract

IntroductionPreterm birth is a major risk factor for disrupted brain development and subsequent neurodevelopmental disorders, yet the underlying mechanisms remain poorly understood. Further, typical neuroimaging analyses are particularly challenging in the neonatal brain: data is frequently low quality and a lack of cellular development violates the assumptions relied on by many commonly-used techniques. In this study, we develop and present an advanced diffusion magnetic resonance imaging method to examine the microstructural organization of white matter in a clinically-acquired cohort of premature neonates. MethodsUsing a novel approach that resolves multiple tissue compartments within the brain, we provide highly detailed orientation and quantification of white matter fibers and tissue signal fraction. We also utilize a series of automated segmentation algorithms to identify and measure these metrics across key tracts and subcortical regions. We investigate how these measures relate to postmenstrual age, as well as to clinical factors reflecting neonatal illness severity. ResultsWe report successful segmentation and reconstruction of numerous white matter tracts throughout the neonatal brain. We further demonstrate the utility and functionality of microstructural analysis in a variety of pathologies commonly encountered in the neonatal clinical environment. Our results demonstrate tract-specific developmental trajectories, with early-maturing pathways showing higher microstructural organization. Exploratory analyses suggest that neonatal illness severity has modest, tissue-specific associations with microstructural properties. DiscussionThis work demonstrates that advanced microstructural imaging methods can extract meaningful white matter measurements from clinically-acquired scans, providing a practical framework for studying neonatal brain development in real-world hospital settings. These metrics are able to be calculated at extremely young ages, potentially allowing non-invasive study of vulnerable populations before detailed behavioral or neurological assessments are feasible.